Search CORE

8 research outputs found

Authorship attribution in portuguese using character N-grams

Author: Baptista Jorge
Markov Ilia
Pichardo-Lagunas Obdulia
Publication venue: 'Obuda University'
Publication date: 01/01/2017
Field of study

For the Authorship Attribution (AA) task, character n-grams are considered among the best predictive features. In the English language, it has also been shown that some types of character n-grams perform better than others. This paper tackles the AA task in Portuguese by examining the performance of different types of character n-grams, and various combinations of them. The paper also experiments with different feature representations and machine-learning algorithms. Moreover, the paper demonstrates that the performance of the character n-gram approach can be improved by fine-tuning the feature set and by appropriately selecting the length and type of character n-grams. This relatively simple and language-independent approach to the AA task outperforms both a bag-of-words baseline and other approaches, using the same corpus.Mexican Government (Conacyt) [240844, 20161958]; Mexican Government (SIP-IPN) [20171813, 20171344, 20172008]; Mexican Government (SNI); Mexican Government (COFAA-IPN)

Crossref

Sapientia

Detección automática de primitivos semánticos con algoritmos bioinspirados

Author: Pichardo Lagunas Obdulia
Publication venue: Pichardo Lagunas, Obdulia
Publication date: 10/05/2017
Field of study

Cualquier diccionario explicativo tradicional inevitablemente contiene ciclos en sus definiciones, es decir, si una palabra es definida en el diccionario y después se usa en una definición, siempre existe un camino en este diccionario que regresa a la misma palabra. Un ejemplo de un ciclo de longitud dos: “pacto es convenio”, “convenio es tratado”, “tratado es pacto”: en dos pasos regresamos a la misma palabra. En un buen diccionario los ciclos son largos, pero son inevitables. Un diccionario semántico computacional (destinado para el uso de las computadoras) no puede contener ciclos en sus definiciones sin que éstos afecten la capacidad de inferencia lógica de los sistemas computacionales. Denominamos primitivas semánticas a un conjunto de palabras que de ser eliminadas del diccionario lo mantendría sin ciclos, es decir, esas palabras no tendrán definición en el diccionario, y en este sentido son primitivas. En esta tesis, nuestra meta es mantener la mayor cantidad de palabras en el diccionario obteniendo un número mínimo de las primitivas semánticas. Presentamos un método que extrae el conjunto de primitivas más pequeño hasta ahora. Para eso utilizamos la representación del diccionario como un grafo dirigido y aplicamos algoritmos bioinspirados que determinan el orden en que el grafo debe ser construido

Red Mexicana de Repositorios Institucionales

Representación computacional de la escritura maya

Author: Pichardo Lagunas Obdulia
Publication venue: Instituto Politécnico Nacional (IPN)
Publication date: 17/12/2018
Field of study

Tesis (Maestría en Ciencias de la Computación), Instituto Politécnico Nacional, CIC, 2008, 1 archivo PDF, (88 páginas). tesis.ipn.m

Red Mexicana de Repositorios Institucionales

Advances in soft computing: 15th mexican international conference on artificial intelligence, MICAI 2016, Cancún, Mexico, October 23-28, 2016, proceedings, part II

Author: Miranda-Jiménez Sabino
Pichardo-Lagunas Obdulia
Publication venue: Springer International Publishing AG
Publication date: 01/01/2017
Field of study

CERN Document Server

Automatic detection of semantic primitives using optimization based on genetic algorithm

Author: Anton Malandii
Grigori Sidorov
Obdulia Pichardo-Lagunas
Yevhen Kostiuk
Publication venue: 'PeerJ'
Publication date: 01/04/2023
Field of study

In this article, we propose a method for the automatic retrieval of a set of semantic primitive words from an explanatory dictionary and a novel evaluation procedure for the obtained set of primitives. The approach is based on the representation of the dictionary as a directed graph with a single-objective constrained optimization problem via a genetic algorithm with the PageRank scoring model. The problem is defined as a subset selection. The algorithm is fit to search for the sets of words that should fulfil several requirements: the cardinality of the set should not exceed empirically selected limits and the PageRank word importance score is minimized with cycle prevention thresholding. In the experiments, we used the WordNet dictionary for English. The proposed method is an improvement over the previous state-of-the-art solutions

Directory of Open Access Journals

Detección automática de primitivas semánticas en diccionarios explicativos con algoritmos bioinspirados

Author: Cruz Cortés Nareli
Gelbukh Alexander F.
Pichardo Lagunas Obdulia
Sidorov Grigori
Publication venue
Publication date: 01/01/2014
Field of study

Inevitably, any explanatory dictionary contains cycles in its definitions, that is, if a word is defined in the dictionary and then used in a definition, there is always a path in the dictionary that returns to the same word. In a good dictionary the cycles are long, but they are unavoidable. A computational dictionary cannot contain any cycles in its definitions without them affecting the ability of logical inference of computer systems. In this study, we name semantic primitives to such words in the dictionary that if removed, the cycles would be eliminated; that is, those words would not have a definition and, in this sense, they are primitive. In this research, our goal is to keep as many words in the dictionary, i.e., to minimize the number of semantic primitives. We present a method that achieves the smallest set of primitives obtained so far. In order to accomplish this, the representation of the dictionary was used as a directed graph, and a differential evolution algorithm, that determines the order in which the graph should be built, was applied to the dictionary.Cualquier diccionario explicativo tradicional inevitablemente contiene ciclos en sus definiciones, es decir, si una palabra es definida en el diccionario y después se usa en una definición, siempre existe un camino en el diccionario que regresa a la misma palabra. En un buen diccionario los ciclos son largos, pero son inevitables. Un diccionario semántico computacional (destinado para el uso de las computadoras) no puede contener ciclos en sus definiciones sin que estos afecten la capacidad de inferencia lógica de los sistemas computacionales. Denominamos primitivas semánticas a un conjunto de palabras que de ser eliminadas del diccionario lo mantendría sin ciclos, es decir, esas palabras no tendrán la definición en el diccionario, y en este sentido son primitivas. En esta investigación, nuestra meta es mantener la mayor cantidad de palabras en el diccionario, es decir, tener un número mí- nimo de las primitivas semánticas. Presentamos un método que obtiene el conjunto de primitivas más pequeño obtenido hasta ahora. Para eso utilizamos la representación del diccionario como un grafo dirigido y aplicamos un algoritmo de evolución diferencial que determina el orden en que el grafo debe ser construido

DIALNET

Unified, Labeled, and Semi-Structured Database of Pre-Processed Mexican Laws

Author: Equihua Miguel
Hernández-Huerta Arturo
Koff Harlan
Martinez-Seis Bella
Perez-Maqueo Octavio
Pichardo-Lagunas Obdulia
Publication venue
Publication date: 01/06/2022
Field of study

This paper presents a corpus of pre-processed Mexican laws for computational tasks. The main contributions are the proposed JSON structure and the methodology used to achieve the semi-structured corpus with the selected algorithms. Law PDF documents were transformed into plain text, unified by a deconstruction of law–document structure, and labeled with natural language processing techniques considering part of speech (PoS); a process of entity extraction was also performed. The corpus includes the Mexican constitution and the Mexican laws that were collected from the official site in PDF format repealed before 14 October 2021. The collection has 305 documents, including: the Mexican constitution, 289 laws, 8 federal codes, 3 regulations, 2 statutes, 1 decree, and 1 ordinance. The semi-structured database includes the transformation of the set of laws from PDF format to a digital representation in order to facilitate its computational analysis. The documents were migrated to JSON type files to represent internal hierarchical relations. In addition, basic natural language processing techniques were implemented on laws for the identification of part of speech and named entities. The presented data set is mainly useful for text analysis and data science. It could be used for various legislative analysis tasks including: comprehension, interpretation, translation, classification, accessibility, coherence, and searches. Finally, we present some statistic of the identified entities and an example of the usefulness of the corpus for environmental laws

Open Repository and Bibliography - Luxembourg